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4 1 Abstract 

5 The gradual loss of diversity associated with range expansions is a well known pattern observed in many 
e species, and can be explained with a serial founder model. We show that under a branching process 
? approximation, this loss in diversity is due to the difference in offspring variance between individuals 
s at and away from the expansion front, which allows us to measure the strength of the founder effect, 

9 dependant on an effective founder size. We demonstrate that the predictions from the branching process 

10 model fit very well with Wright-Fisher forward simulations and backwards simulations under a modified 

11 Kingman coalescent, and further show that estimates of the effective founder size are robust to possibly 

12 confounding factors such as migration between subpopulations. We apply our method to a data set of 
o Arabidopsis thaliana, where we find that the founder effect is about three times stronger in the Americas 

14 than in Europe, which may be attributed to the more recent, faster expansion. 

15 2 Introduction 

16 We may think of a range expansion as the spread of a species or population from a narrow, geographically 

17 restricted region to a much larger habitat. Range expansions are a common occurrence in many species 
is and systems, and they happen on time scales that differ by orders of magnitude. Viruses and bacteria may 
is spread across the globe in a few weeks Brockmann and Helbing (2013), invasive species are able to colonize 

20 new habitats over decades (Davis, 2009); and many species migrated into their current habitat over the 

21 last few millenia, following changing environments such as receding glaciers (Hewitt, 1999; Taberlet et al., 

22 1998). 
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23 The population genetic theory of range expansion is based on two largely distinct models. The first 

24 model, based on the seminal papers of Fisher (1937) and Kolmogorov ct al. (1937) and often called 

25 the Fisher-KPP model, is based on the diffusive spread of alleles, and has been mostly explored from 

26 a statistical mechanics viewpoint. The other model, called the serial founder model, has its roots in 

27 the empirically observed decrease in genetic diversity from an expansion origin Austerlitz et al. (1997); 
2s Hewitt (1999); Ramachandran ct al. (2005). 

29 The Fisher-KPP partial differential equation describes the deterministic change in allele frequency 

30 at a spatial location due to the individuals with a selected allele having more offspring than wild-type 

31 individuals. The model can be applied to range expansions by substituting the selected allele with 

32 presence of a species, and the wild-type allele as absence of a specics(see e.g. Barton et al., 2013, for a 

33 recent review). Its solution is a travelling wave; similar to logistic growth , populations grow to some 

34 carrying capacity. This model has received more widespread attention recently due to the empirical 

35 tests by Hallatschek et al. (2007), who compared growing colonies of E. coli to the predictions from the 

36 Fisher-KPP equation. 

37 However, it is also apparent that some of the predictions of the Fisher-KPP model are inconsistent 
3s with many macroscopic systems. In particular, the Fisher-KPP model predicts that local populations 

39 start with extremely small population sizes. This leads to a very high amount of genetic drift, and, 

40 for example, in the experiments of Hallatschek et al. (2007), all local genetic variability was quickly 

41 eliminated, so that no polymorphisms were shared between individuals sufficiently far from each other. 

42 This is in strong contrast to humans, for example, where an expansion out-of-Africa is well supported 

43 (e.g. Ramachandran et al., 2005), but where we find many genetic variants shared between all human 

44 populations. This was part of the motivation for the development of the serial founder model, which 

45 is typically based on a variant of a stepping stone model (Kimura, 1964) in one or two dimensions. 

46 Typically, only a small subset of populiations is colonized at the beginning of the process, but over time 

47 subsequent populations are colonized, usually by means of some founder effect. There are multiple ways 

48 a founder effect has been modelled: Austerlitz et al. (1997) and Ray et al. (2010) chose to model the 

49 founder effect by local logistic growth, where each local subpopulation grows according to the logistic 

50 equation. A simpler model, favored by DeGiorgio et al. (2011) and Slatkin and Excoffier (2012), is to 

51 model the founder effect by a temporary reduction in population size. 

52 A complete serial founder model, including selection, founder events and migration between subpop- 
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53 ulations, has, so far, be proven to complex to be of use for analytical research. However, a recursion 

54 approach (e.g. Austerlitz, 1997) and simulations (e.g. Klopfstcin et al. 2006, Travis & Burton 2010) have 

55 been successfully applied to investigate the behavior of the model. Other alternatives are models that do 

56 not model expansions cxplicitely, and make additional simplifications. Perhaps the simplest model of this 

57 kind is that of a demographic expansion without any spatial component, which can be fully described by 

58 a change in the rate of coalescence Gravel et al. (2011); Li and Durbin (2011), with the assumption made 

59 that the population is panmictic throughout the expansion. This model resembles a spatially explicit 
eo expansion, when migration between demes in the latter are very large. A more sophisticated model, 

61 incorporating spatial structure, is the infinite island approximation model of Excoffier (2004). In this 

62 model, an originally small, panmictic population expands instantly into a mctapopulation with a large 

63 number of subpopulations. In contrast to the demographic expansion, in this model we can compare coa- 

64 lescent time distribution between demes and within demes (called the mismatch distribution), which has 

65 been used for inference previously. However, this model assumes that all subpopulations are exchangable, 

66 so that there is no difference in coalescence times with individuals at a wave front, when compared to 

67 individuals in the center of a population. A further step were the models of DeGiorgio et al. (2011) and 
6s Slatkin and Excoffier (2012). DeGiorgio et al. (2011) derived coalescence times under a serial founder 

69 model, using a small bottleneck as a founder event. In the model of Slatkin and Excoffier (2012), the 

70 expansion is modelled as a spatial analog of genetic drift, where each founder event corresponds to a 

71 generation in a standard Wright-Fisher model. 

72 So far, the theoretical models of range expansions have let to few applications that can be applied 

73 to interpret genetic data from non-model organisms. In this paper, we first develop a simple model of a 

74 range expansions based on a branching process approximation. The advantage of the simplicity of the 

75 model is that it leads to the development of an intuitive understanding of an expansion. We test the 

76 model using simulations, and discuss its limitations, and then show how it can be used for inference. 

77 We demonstrate the utility of our approach by re-analyzing SNP data of the model plant species 

78 Arabidopsis thaliana (L.) Heynh. (Horton et al., 2012). A. thaliana is a small, annual plant, thought 

79 to be native to Europe, but introduced in North America and other locations (Jorgensen and Mauricio, 
ao 2004). The biogeograhy and population structure of A. thaliana has been well studied (Horton et al., 
si 2012; Jorgensen and Mauricio, 2004; Nordborg et al., 2005) While earliest studies showed relatively little 
82 population differentiation on a global scale, genome-wide genetic data supports widespread population 
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structure and clear genetic differentiation between populations (Horton et at., 2012; Nordborg et al., 
2005). The availability of genome-wide SNP-chip data from more than a thousand individual plants from 
hundreds of locations make A. thaliana an ideal test case for the genetic signatures of range expansions. 
However, the status of A. thaliana as a human commcnsialist and the fact that A. thaliana is a selhng 
plant, may make the analysis more challenging. 

3 Results 

3.1 Overview of theoretical results 

In this section, we will briefly outline our model and the main theoretical results. Details and full 
derivations can be found in the appendix. A schematic of the model studied is given in Figure 1. In 
brief, we assume a serial founder model on a one dimensional stepping stone grid, where initially only 
one deme is colonized. We compare the allele frequency of individuals in the same location as the origin 
of the population at time t, X t , with individuals at the wave front at time t, which we denote by X t . In 
particular, we are interested in the difference in derived allele frequency between the population at the 
starting position and the expansion front, which we denote as Z t . In Appendix A.l, we show that the 
expected difference in allele frequency is 



where fo is the initial frequency of an allele, and L(t) and L(t) are the probabilities that an allele is 
lost by time t at the origin of the expansion and the wave front, respectively. 

We can make this result more explicit assuming the populations evolve according to a branching 
process. A (Galton- Watson) branching process (Harris, 1954) models the evolution of a population 
by assuming that all individuals produce offspring independently from each other, with some offspring 
distribution F. In Appendix A. 2 we use standard results from branching process theory to show that if 
each deme evolves according to a branching process, then (1) can be written as 



that is, the difference in allele frequency is expected to increase linearly with distance, and the slope is 




(1) 




(2) 
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half the difference in the variance of offspring distribution at the expansion front Vax(F) and away from 
the expansion front Var(i^). Since we assume that founder effects occur at the expansion front, we expect 
it to have a higher offspring variance, corresponding to a lower effective population size. It is worth 
pointing out that the term of order t in EZ t docs not depend on /o, so that we expect the same slope 
independent of the initial allele frequency. As the higher order terms depend on /o, we will examine the 
accuracy of this result using simulations. 

In Appendix section A. 3 we then use the offspring variance from a Wright-Fisher model to define an 
effective founder size k e , and show tat 

= 5 (t 1 ~ A*- (3) 



2 V K 

114 where N e is the effective population size of a demc. In some cases, it might be possible to interpret N e 
us and k e directly. For example, if wc think of a species colonizing a system of islands, N e corresponds to 
ne the carrying capacity of that species on a given island, and k e to the number of founders. In most cases, 
n7 however, subpopulations are not clearly defined and the population is relatively continuously distributed, 
us Under these circumstances, it is not clear what N e and k e represent. Therefore, we show in section A. 4 
us that it makes more sense to think about the distance over which the ratio jf- has a certain value (e.g. 

120 0.99), that is, how far apart demes need to be so that each founding population is 1% lower than the 

121 population at equilibrium. The larger this distance, the weaker the founder effect. 

122 Finally, in Appendix A. 5 we show how we can estimate EZ t from genetic data using the ip statistic 

123 defined as 

*= f h i; h : f (4) 

J12 + .111 + J21 

124 where is the entry in the allele frequency spectrum, tp was introduced by Peter and Slatkin 

125 (2013), and we show in Appendix A. 5 why ip is an useful estimator of ¥,Z t . Taken together, these results 

126 suggest that we can define and estimate the effective founder effect that describes the loss of genetic 

127 diversity with distance from the expansion origin, and that we can infer the strength of the founder effect 
i2s using a simple linear regression on the allele frequency of shared alleles. 
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129 3.2 Simulations 

130 We validate our analytical results by performing extensive simulations under two different models. The 

131 first is the forward-in-time model stepping stone model described by Slatkin and ExcofBer (2012). The 

132 second model is a backward-in-time stepping stone model, based on the Kingman coalescent (Wakeley, 

133 2009). 

134 3.2.1 Forward simulations 

135 We first validate our results using a discrete-time, forward-in-time Wright-Fisher model. In Figure 2, we 

136 give results for various initial allele frequencies /o, setting k e = 0.1N (first row), k e — 0.5N (middle row) 
13? and k e = Q.9N (bottom row). Using equation 3, we would predict Z t to be 4.5i, 0.5t and i/18, respectively, 
us Those predictions are given by the red lines; the points represent data observed in simulations. We find 
139 that we get better estimates when i) the effective founder size is low, ii) the time after the expansion 
wo is low and hi) the effective population size is high. In particular, we find that we get a systematic bias 

141 when we have a very strong founder effect, and thus allele frequency differences are expected to be very 

142 large. In that case, many alleles will become fixed in the population, and the predictions between the 

143 Wright-Fisher and the branching process models are quite different. 

144 In Figure 3, we investigate the effect of demes growing to their carrying capacity via logistic growth, 

145 as opposed to instantaneous growth which we assume in most other cases. Here we can apply the result 
«6 that under non-constant founder population sizes, the effective founder size is simply the harmonic mean 
147 of all founder sizes, divided by the number of generations. In Figure 3, simulations were performed with 
us a carrying capacity of 10,000, and growth starting 

149 3.2.2 Backwards simulations 

150 We also performed backward-in-time simulations i) to test the robustness of the branching process pre- 

151 dictions to migration, ii) to test the effect of estimation from a subsample and iii) to to remove the initial 

152 allele frequency as an explicit parameter. Coalescent simulations were performed in a continuous-time 

153 model with discrete expansion events. In particular, most of the time lineages are allowed to merge 

154 according to the standard structured coalescent. The only exception are the expansion events, which are 

155 modelled as a single generation of Wright-Fisher mating, followed by moving all lineages in the newly 

156 colonized deme back to the founder deme. Thus, unlike the Kingman-coalescent, this model allows for 
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multiple mergers at the wavefront. Under this model, 

since the founder effects result in an increase of the offspring variance by a factor of (2k e )~ 1 . We estimate 
¥,Zt using the ^-statistic defined in (Peter and Slatkin, 2013), justification of this is given in Appendix A. 5. 
Results are displayed in Figure 4. In the top row, we show samples taken immediately after the expansion 
reached the boundary of the habitat, in the bottom row, we show samples that were taken a very long 
time (207V generations) after the expansion finished. We find that recent expansions are detected rather 
easily, almost independent of the migration rate, and the effective founder size is estimated with high 
accuracy. In the bottom row, we observe that for intermediate migration rates (M = 0.1 and M = 1), we 
still get a relatively accurate result, however, we have more noise, indicating that much larger samples 
would be required to obtain confident estimates, since most SNP will be either fixed or lost after this 
time. For a low migration rate, we see that we do not have any power for inference, since individuals 
all coalesce within their demes before they have the opportunity to coalesce with lineages from other 
locations. For high migration rates (M = 100), we find that the signal of the range expansion has almost 
vanished. Under these conditions, migration is so strong that the population essentially resembles an 
equilibrium isolation-by-distance population, and the signal of the range expansion has been lost. For 
M = 10, we see an intermediate behaviour, close to the origin we are at equilibrium, but far away the 
slope of the curve is still the same as we would expect under an expansion. 

3.2.3 2D simulations 

In addition to the results presented in the previous two sections, where we performed simulations in a one- 
dimensional habitat, we also performed simulations on 2D-stepping stone model to investigate the impact 
of multiple dimensions. We performed simulations by simulating expansions both under a migrant pool 
model and a propagule pool model (Slatkin and Wade, 1978). In a migrant pool model, all neighbouring 
populations send migrants at equal rates, whereas under the propagule pool model, one possible founder 
population is selected to send out a "propagule", which colonizes the new deme. We find that if the 
sample axis is parallel to the orientation of the stepping-stone-grid, then the migration model docs not 
matter, and we get the same behavior as in the ID case. However, if we sample a diagonal we find 
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183 that the results from the ID-simulations are applicable under the propagule pool model, but not under 

is4 the migrant pool model (Figure 5. The reason for that is quite simple: under the migrant pool model, 

185 there are many different paths on how a deme can be colonized, and the number of paths increases with 

186 distance from the origin, which reduces the amount of drift in a non-linear way with distance from the 
is? origin. In contrast, under the propagule pool model, always one path is chosen. We also find that under 
las the 2D-model the signal of the expansion disappears faster. Whereas in the ID-model at a migration 

189 rate of M = 1 the expansion is still detectable after 20N generations, we find that at the same migration 

190 rate, the population already approaches equilibrium. 

191 3.3 Application to A. thaliana 

192 We applied our model to the SNP dataset of Horton et al. (2012). Based on a PCA analysis (Figure 6a) 

193 and the sample locations, we defined five regions for further analysis: Scandinavia, Americas, as well as 

194 Western, Central and Eastern Europe. 

195 In the Americas, (Figure 6c) we find a most likely origin in the Great Lakes region, as opposed to 

196 the East coast. This might be somewhat surprising, but as we only have one sampling location at the 
19? East cost, power might be low. However, we can speculate that traders through the Great Lakes first 

198 introduced A. thaliana to the US, which may explain the signal. This may be seen as evidence for a 

199 recent introduction of A. thaliana into the Americas from Europe. Further support for this hypothesis 

200 comes from the fact that the American samples cluster togethr with the Western European samples in 

201 the PCA analysis, and from the fact that the North American population shows the strongest founder 

202 effect, with a 1% founder effect every 4.26 km. 

203 Scandinavia (green dots in Figure 6a, Figure 6c), shows the most diversity within a region according 

204 to the PCA plot, and the second highest founder effect size. The most distinct samples, in the bottom left 

205 of Figure 6, are from Northern Sweden and Finland, whereas those samples that cluster with the Central 

206 and Eastern European Accessions are predominately found in Southern Sweden. We find evidence of 

207 immigration from the East, with the most likely origin of the Scandinavian accessions lying in Finland. 

208 Based on the PCA analysis, we might expect the accessions from Southern Sweden to show evidence of 

209 a range expansion from the South, and that is indeed the case when we only consider these Southern 

210 samples. 

2u If we analyze these Southern Scandinavian samples together with the samples from Austria, Czech 
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212 Republic, Russia, Lithuania and Tajikistan (Figure 6d, pink and brown dots in the PCA), we find evidence 

213 of an expansion out of eastern Asia, possibly from a refugium close to the Caspian Sea. For the Central 

214 European samples, we find an origin close to the border between Austria and Italy. This is likely a 

215 proxy for a refugium in either Southern Italy or the Balkan region, as the inferred origin was covered by 

216 an ice sheet during the last glacial maximum. Finally, for the Western European samples we find the 

217 weakest founder effect among all analyzed region, with a 1% founder effect at a scale of 38.6 km, almost 
2is an order of magnitude weaker than the strongest founder effect we observed in this set of populations, 

219 in the Americas. This is however partly due to the aggregation of the British and continental samples; 

220 if we just analyze the French, Spanish and Portugese samples (excluding the British samples), we find 

221 a founder effect of 18.7, in line with the other continental European regions. In contrast, if we analyse 

222 the British samples separately, we estimate an 1% decrease to occur over 47.8 km, and in fact we cannot 

223 exclude equilibrium isolation by distance, as, after Bonfcrroni correction, asp > 0.05. 

224 4 Discussion 

225 In this paper, we study range expansions using a serial founder model, with the main goal to develop 

226 inference procedures. We use a branching process approximation to approximate the decay of genetic 

227 diversity due to the recurring founder effects. We use this approximation to define an effective founder 
22s size, which can be estimated using standard linear regression from genetic data. 

229 A linear or approximately linear decline of genetic diversity with distance has been observed previously 

230 in humans (DcGiorgio et al., 2009; Ramachandran et al., 2005) and in simulations (DcGiorgio et ah, 2009; 

231 Peter and Slatkin, 2013). In previous work, we showed, using simulations, that the directionality index 

232 ip, defined in equation (4), increases approximately linearly with distance (Peter and Slatkin, 2013). In 

233 this paper, we connect these empirical observations with a theoretical model, that explains this decay in 

234 terms of differences in offspring variance. This is justified because in populations with a higher offspring 

235 variance, genetic drift occurs faster and therefore the population's effective size becomes smaller. 

236 While branching processes have a long history in population genetics (Ewens, 2004), they differ from 

237 other commonly used models such as the Wright-Fisher model and the Coalcsccnt in that the total 

238 number of individuals in the population is not constant (or following a predetermined function). Instead, 

239 the expected number of individuals is constant, leading to different dynamics. For example, a neutral 
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240 branching process will eventually die out almost surely, something that cannot happen under the Wright- 

241 Fisher model. Therefore, the models presented here are only useful in parameter regions where the 

242 branching process model and other population genetic models result in similar dynamics. For example, 

243 our model breaks down if there are only few shared variants between populations. However, in this 

244 case, phylogeographic methods are arguably more appropriate than population genetic ones. Otherwise, 

245 our model appears to be useful as long as a significant fraction of variants has a most recent common 

246 ancestor during the expansion or before the expansion started. If that is not the case, as in the simulations 

247 with high T and high M in Figure 4, we find that ip will be very close to zero, due to the signal of the 

248 expansion vanishing over time. The last parameter region where the model breaks down is when the mean 

249 allele frequencies become very large (Figure 2. In that case, the increase slows down due to fixations in 

250 the Wright-Fisher model, whereas it may further increase under the branching process approximation, 

251 explaining the difference. 

252 Similar to the effective founder size defined in Slatkin and Excoffier (2012), the effective founder size 

253 k e we defined here is a variance effective size. This is different from the model proposed by DcGiorgio 

254 et al. (2009), where an explicit bottleneck was used to model a founder effect. Using an effective size is less 

255 specific than that - there are many models that will lead to the same founder size - but has the advantage 

256 that the same formalism can be applied to many different situations. We also showed various rescaling 

257 properties. Perhaps counterintuitivcly, EZ t is largely independent of the expansion speed, conditional on 
25s k e . The reason for that is that, even though more segregating variants will be lost in a faster expansion, 

259 the difference between the expansion front and the rest of the population remains the same. Similarly, 

260 waiting after the expansion finished will not change EZ t . 

261 Of course, an effective size has its limitations, as in essence, it is just a measure of the speed of 

262 genetic drift. Many other models exist that may lead to similar or identical patterns of genetic diversity 

263 (DcGiorgio ct ah, 2009). However, in many cases it is natural to assume a range expansion occurred, 

264 often through climatic or historical evidence. In these cases, our framework may provide a starting point 

265 for a genetic analysis. 

266 The analysis of the A. thaliana data shows both the usefulness and some of the limitations of our 

267 approach. We are able to identify expansion origins and infer the strength of the founder effect from 

268 genetic data. In the A. thaliana data set, we find that the founder effect is much stronger in the 

269 Americas than in continental Europe. This is an interesting pattern, and it would be very interesting to 
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270 see if the same is true for other introduced or invasive species. In Europe, our results are consistent with 

271 previous analyses by Nordborg et al. (2005) and Frangois et al. (2008). Nordborg et al. (2005) found 

272 that Arabidopsis likely colonized Scandinavia both from the East, through Finland, and from the South. 

273 The strong population structure is consistent with this finding a global pattern of an Eastern origin, and 

274 evidence for immigration from the South when just analyzing the Southern Swedish samples, or if we 

275 jointly analyze them with Eastern European and Asian samples. Overall, we identify a likely ice-age 

276 refugium close to the Pyrenees in Southern France or Eastern Spain, a likely refugium near the Caspian 

277 sea and a refugium in central Southern Europe, cither in the Balkan or Italy, where denser sampling is 
27s required for a more accurate picture. In the Americas, we find that Arabidopsis experienced very strong 

279 founder events, and we identify a most likely point of introduction near the Great Lakes. 

280 On the other hand, describing the founder effect as a distance over which genetic diversity decreases 

281 by a certain amount is not as satisfying as is the inference of an effective founder size, on the same scale 

282 as the effective population size. However, it is necessary because of scaling reasons; if a single population 

283 spans a larger area, then we necessarily need a strong founder effect to get the same diversity gradient 

284 than. On the other hand, if we subdivide the area of the large population into smaller populations, each 

285 of those will have its own, smaller founder effect, but the population will experience a larger number of 

286 founder events. Thus, if we know the scale of a local population, or can reasonably approxiamte it (e.g. 

287 if we know the dispersal distance of the species). We can obtain an estimate on how much lower the 

288 founder size is compared to the effective size at carrying capacity in equilibrium. On the other hand, 

289 interpreting the founder effect as a distance allows us to obtain a measure that is independent of how 

290 populations actually occupy space, which is more versatile, but somewhat harder to interpret. 

291 5 Methods 

292 5.1 Forward WF-simulations 

293 Forward simulations were performed using a simple simulator implemented in R. Simulations were started 

294 with a fixed initial frequency /o, and allowed to evolve for a fixed number of generations. Every t — 1-th 

295 generation, the rightmost deme founded a new population, first with a single Wright-Fisher generation 

296 of size fc e , which then, in the t-th generation, expanded to size N. All demes except the newly founded 

297 one underwent t generations of Wright-Fisher mating in the same time frame, thus after gt generations, 
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g demes are colonized. ¥,Z t was estimated from 10 6 replicate alleles. More complex models were imple- 
mented the same framework, i.e. we added migration between all demes at each generation and allowed 
the population to evolve for additional generations after the expansion finished. We also used a modifica- 
tion that allowed for changes in population size after each expansion event, and we used this modification 
to study the effect of logistic growth (see Figure 3) . 

5.2 Backwards simulations 

Backward-in-time simulations were performed using the standard structured coalescent model Wakeley 
(2009), with a minor modification. The structured coalescent allows easy inclusion of migration events, 
but coalescence include migration and colonization events. The coalescent is usually studied in the 
continuous limit where the number of generations and population sizes are both very large. We follow 
this approximation with the exception of expansion events, which are modelled using a single generation of 
Wright-Fisher-mating. Backward in time, we stochastically merge lineages, the the backwards-transition 
probability for the number of lineages is (Watterson 1975, Wakeley 2008, p. 62): 

? U} k r -i 

P(L t + l=j\L t =i-k e ) = ^^, (6) 

where k e is the effective founder size, L t is the number of lineages at time i, (time measured backwards in 
time in coalescence units), S± is the Stirling number of the second kind and iVui is the jth falling factorial. 
If the number of lineages is reduced, we merge lineages uniformly at random. All remaining lineages are 
then transported to a neighbouring colonized deme. To compare this model to our predictions from the 
branching process model, we have to consider the excess variance in offspring distribution resulting from 
these expansion events, which is j^-, such that for this coalescent model 

H^ = i t + 0 (i). (7) 

Thus, the smaller the effective founder size k e , the larger the allele frequency gradient will be. 1D- 
and 2D-simulations were performed using the same simulator. For ID-simulations, we sampled eleven 
samples with n lineages every 5th deme, with 20 additional domes to avoid boundary effects. For the 
2D-simulations, we sampled both a diagonal and horizontal transect. The horizontal transect, parallel to 
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321 the demic structure, had length 30. The diagonal transsect, where demes were colonized every time 

322 units, had length 20\/2 , so that both transect are colonized in approximately the same time. 

323 5.2.1 Application 

324 The data set of Horton et al. (2012), along with the coordinates for the accessions was downloaded 

325 from the project's website at http://bergelson.uchicago.edu/regmap-data/. Genotypes of the sister 

326 species Arabidopsis lyrata provided by Matthew Horton were used to determine the ancestral state for 

327 each SNP. SNP where we could not determine an ancestral state unambiguously, cither because no 

328 homolog A. lyrata allele was found, or the allele A. lyrata was not present in A. thaliana, were removed. 

329 Similarly, we removed all individuals where we did not have sampling coordinates. Since A. thaliana is 

330 a selfing plant and highly inbred accessions were sequenced, we only had a single haploid genotype per 

331 individual. Since our methodology requires at least two sampled haplotypes, we restricted our analysis 

332 to locations with at least two accessions sampled. To avoid bias due to very closely related accessions, we 

333 subsequently removed locations where the plants differed at less than 1.5% of sites (average heterozygosity 

334 of locations was 7.1%, with a standard deviation of 3.2%). This resulted in a total of 149 locations 

335 with at least two samples, representing 855 individuals, with 121,412 SNP genotypes remaining. As a 

336 single, uniform expansion throughout Europe seems rather unlikely, we performed a PCA analysis to find 

337 the main axes of population differentiation (Figure 6a). As the resulting pattern divided the samples 
33s broadly into four different groups, we analyzed data from these groups seperately. These groups are: 

339 Americas (black), Western Europe (blue), Central Europe (red) and Scandinavia (green). For each of 

340 these groups, we estimated the origin of the range expansion using equation 5 of Peter and Slatkin (2013). 

341 For visualization, we evaluated eq 5 of Peter and Slatkin (2013) on a grid (with locations not falling on 

342 land excluded), and estimated the best fit for the slope parameter (v) using linear regressions, with the 

343 location with the highest r 2 corresponding to the least squared estimate of the origin of the expansion 

344 (Figure 6b-f). 

345 The expected value of tp depends on the ratio of the effective founder size k e to the effective population 

346 size N e and the number of demes that the population colonized. The number of demes is relevant, since if 

347 we subdivide the population into more demes, it will undergo more (but weaker founder effects) over the 

348 same physical distance, or conversely, if we assume that demes arc large, then we have few founder events 

349 with a very strong founder effect. Using the simple model developed in this paper, we cannot distinguish 
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350 these cases without additional extraneous information. For example, we may fix the size of each deme 

351 based on extraneous information. For example, if the mean dispersal distance is known for a species, 

352 we may assume that the spatial extent of each deme is approximately that dispersal distance, and we 

353 can calculate k e relative to that quantity. In this context, the ratio k e /N e has the interpretation as the 

354 percentwise reduction in Wright's neighborhood size. Alternatively, if the dispersal distance is unknown, 

355 we may fix the ratio r = k e /N e to an arbitrary constant, and instead report the required distance x e over 

356 which the effective founder size is k e . This has the advantage that it provides us with a quantity that is 

357 independent of assumptions of the demic structure, and the larger x e is, the weaker is the founder effect 
35s of the population. For illustration purposes, we calculate the ratio k e /N e for deme sizes of 1km, 10km 

359 and 100km, as well as x e for all groups and report them in Table 1. 
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4is A Derivation of main results 

419 A.l Discrete time expansion model 

420 We model a range expansion on a one-dimensional stepping stone model with potential deme positions 

421 0, 1, 2, . . . labelled di, i = 0, 1, ... . All but deme do are not colonized at the start of the process. We 

422 denote the frequency of an allele of a biallelic marker in deme di at time t as fi(t), and we assume that 
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423 jo(0) = /o) where fo is some constant. The population behaves as a Markov process, so that the allele 

424 frequencies at time step t only depend on step t — 1. Each time step, genetic drift will change allele 

425 frequencies according to some probability distribution. In addition, deme dt will become colonized by the 

426 offspring of individuals present at time t— 1 in dome d t -\ according to some other probability distribution. 

427 For simplicity, we at first assume there is no migration between demes, and test the robustness to this 

428 assumption using simulations. 

429 Let {X t } = {./b(0), /o(l), • ■ ■ fa(t)} and {X t } = {/o(0), /i(l), . . . f t (t)} be the processes at and away 

430 from the wave front. Since we disallow migration, we can describe the history of "intermediate" demes 

431 di, 0 < i < t by processes {X^} = {/o(0), /i(l), • • ■ fa{i), fi(i+l) . ■ ■ fi(t)}- In words, demes are colonized 

432 when the wave front first reaches them, and the subsequent evolution depends only on the allele frequencies 

433 at the time when they first evolved. From this construction, it follows that for i < j, {X^} and {X^} 

434 are conditionally independent given fi(i). Together with the Markov property this implies that the 

435 difference in allele frequency in two demes is a function of distance, i.e. they obey 

F(X® t xP\Mi)) = F(X t ^,X t ^\f 0 ). (A.l) 



436 Throughout this section, we assume that EX t |Ao = /o is constant, which is satisfied if there are new 

437 new mutations and no selection, and we further assume that Var(At) < oo. For example, for the critical 
43s branching process model we introduce in the following section, Var(X t ) = at, where a is the offspring 
439 variance in one generation. Then the autocovariance for s < t is, 



Cov(X s ,X t ) - Var(X s ), (A.2) 

440 and similarly for X, because {X t },{X t } are martingales. 

441 Next, we define the conditioned processes {Y t } = {X t \X t > 0} and {y}{X t |X t > 0} which give the 

442 allele frequency conditional on the allele not being lost. 

443 Then, we have that 

_, E.Y/ EA'/ . , . 

^Y t = — — = — (A.3) 

* P(X t > 0) 1 - L(t) y ' 

444 since 

EX = E(X\X > 0)¥(X > 0). (A.4) 
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445 Here, L(t) = P(X t = 0) denotes the probability that an allele is at frequency zero in generation t, and 

446 we remove the dependency of L(t) from fo for notational convenience. 

447 Using the conditional variance formula, we can compute the variance and autocovariance of {Y t }: 

Va.ffl, = ^L-(^m' (A,, 



1-L(t) \1-L(t) 
^^+L(t)(EY t f (A.6) 



and covariancc for s < t 



Cov(r s ,F t ) - EY s Y t -EY s EY t (A.7) 

= E(X s X t \X t > 0) - E(X S \X S > 0)E(X t \X t > 0) (A.8) 
E(X s X t ) ft 



P(X t > 0) P(X S > 0)¥(X t > 0) 
Var(X s ) , r , . f ( 



(A.9) 



2 



" I -Lit) ' L{s \l-L( S ))(l-L(t)) (A - 10) 

449 The last quantity of interest is the difference Z t = Y t — Y t , which gives the difference in allele frequency 

450 between the wavefront and the origin of the expansion, conditional on an allele surviving in both locations. 

451 We find that 

EZ t = /of 1 — T 1 ) (A.ll) 

JQ \1-L(t) 1-L(T)J 1 ' 



Var(Zt) = Var(r 4 )+Var(r t ) (A.12) 
Var(X t ) Var(A > t ) 



1 - L(t) l - L(t) 

and 



L(t)EY t + L(t)EY t (A.13) 



Cov(Z s ,Z t ) = Cov(Y s ,Y t ) + Cov(Y s ,Y t )-Cov(Y s ,Y t )-Cov(Y s ,Y t ) (A.14) 
Var(A s ) (1 - L(s)) + f$L(s) Var (^) (l ~ L ( s )) + foH s ) 



(1 -£(«))(!- L(t)) (I - L(s))(l - L(tj) 



(A.15) 
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454 A. 2 Branching process 

455 To further specify the moments derived in Appendix A.l. we need to define Var(AT s ), L(s) and / 0 , and the 

456 corresponding quantities at the wave front. This is particularly easy using a Galton- Watson branching 

457 process. Under this model, each generation individuals leave offspring independent from each other 
45s according to some offspring distribution F. Let Li(t) denote the probability that an allele has been lost 

459 by generation t, given that it started with i copies in generation 0. Kolmogorov (1938) showed that when 

460 t is large, L\ is well approximated by 

^^-ndW (A - 16) 

461 where F is the offspring distribution and V&rF is assumed to be finite. We assume that a branching 

462 process with offspring distribution F describes neutral genetic drift at the wave front, and that the 

463 colonization of new domes occurs according to a branching process with offspring distribution F. 

464 If the initial frequency fo of the allele is greater than one, the corresponding expression becomes 

L So {t) = {Li{t)) } \ (A.17) 

465 by independence of individuals. Using a Taylor expansion around t = oo yields 

^Z t = /o( i J-pr) (A.18) 

J °\l~L f0 (T) l-L f0 {t)) 
= 2 (Var W -Var(^ ^ + o ( g ) (A.19) 

466 Thus, we find that the expected difference in allele frequency between the expansion origin and the front 

467 of the population increases approximately linear with distance, the slope of the curve being the difference 
46s in offspring variance of individuals at the wavefront and expansion origin. From the second term in 

469 the Taylor expansion we see that the approximation is suitable when t > f§, i.e. the number of domes 

470 between the two samples is large, and the frequency of the allele at the founding location is small. 
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471 A. 3 Effective population size 

472 The variance effective population size for a Cannings model is defined as 

N - = V^F) (A - 20) 

473 , where N is the absolute number of individuals per population. The branching process considered above 

474 is not a Cannings model, however, the evolution of the offspring of a single individual under a Cannings 

475 population is well approximated by a branching process, as long as that offspring only makes up a small 

476 fraction of the individuals in a population. Fisher(1930) pioneered the modelling of population genetics 

477 using branching processes (Ewens, 2004, p. 29). Under a Wright-Fisher model, the offspring distribution 
47s of a single individual has mean and variance very close to one. This justified Fisher to approximate the 
479 evolution of an individual under the Wright-Fisher model as evolving according to a branching process 
4so with a Poisson(l) offspring distribution, which has offspring variance 1, a model we will also use here to 

481 model genetic drift away from the wave front. 

482 To incorporate the reduced effective size of a founder effect at the wave front, we use a modified 

483 offspring model: with probability (1-a), an individual at the wavefront does not produce any offspring. 

484 With probability a, the number of offspring is Poisson distributed with parameter 1/a s.t. the overall 

485 expected number of offspring is still one and the variance is Vwc(F) = a^ 1 . This allows us to define an 

486 effective founder size k e 

k e = aN, (A.21) 

487 which measures the "increase" in genetic drift at the wave front. 

488 Combining eq. A.21 and eq. A. 19 yields 

EZ t = U^-l)t (A.22) 



2 V k 



(A.23) 



489 From this, we see immediately that EZ t = 0 only if N e = k e , and also that the effective founder size 

490 enters the equation only in the ratio n = 77 , so that it makes sense to further define the relative founder 

491 size K, which measures the strength of the founder effect. 
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492 A. 4 Rescaling 

493 The branching process we used above assume that exactly one generation of genetic drift happens between 

494 each founder event. In this section, we show that the expected allele frequency difference between the 

495 expansion front and at the origin is (i) invariant of additional generations between expansion events and 

496 (ii) invariant to additional generations after the expansion finished. 

497 Both follow from the fact that for a branching process with mean 1, the variances of subsequent 
49s generations can simply be added: Consider the generating function of a critical branching process B 

499 after t generations, denoted by pt{s) which has variance p t (l)" . Then, after an additional generation, the 

500 generating function becomes q(p t (s)), where q(s) is the generating function of the offspring distribution 

501 of that additional generation. Then, the variance in offspring after this additional generation is q(p t (l))" . 

Var(B) = (p' t (lfq"(p t (l)) + q '(p t (l))p'((l) = q"(l) +p t (l), (A.24) 



502 since p' t (l) = Pt (l) = q(l) = q'(l) = 1. 

503 Thus, if individuals in the range expansion model have offspring variance v at the expansion front 

504 and variance v away from the front, the total variance after t time steps with d expansion events is 

505 (t — d)v + dv. 

506 Now from eq. (A. 19) we have (for / 0 = 1), 



EZ rf 



Var(X d ) - Var(A d ) 
[(dv) - (dv)] 



(A.25) 



so? Adding T generations with neutral drift between each founder event and r generations after the 
5os expansion stopped, changes this only to 

EZ d = - [(dv + (d- l)Tv + tv) - (dv + (d- l)Tv + tv)} 

509 which simplifies to eq. (A. 19). 

510 We can model more complex expansion models, such as an extended bottleneck or logistic growth 
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511 similarly. Again, this will result in an increase of Var(Xd) and Var(Xd) by the same amount, which 

512 cancels in the difference. 

so Furthermore, we can also change how we subdivide a population into demes. It is easy to see that 

514 a population with expansions at times 0,1,2,... and offspring variances Vai(F) and Vai(F) behaves 

515 similarly to a population with expansions occuring at times 0, St, 2St, . . . with offspring variances Var J' F ' 1 

516 and Var ^ F ) in the sense that EZ t will be the same for cither population. This suggests that it is not 

517 important how we subdivide space into demes, only the relative size of the founder population versus the 
sis neutral populations matters. Thus, it is most convenient to report the strength of the founder effect in 
sis units of "decrease in genetic diversity per unit of distance. 

520 A. 5 Estimation 

521 To estimate EZ t from genetic data, we need to take subsampling into account, i.e. we need to estimate 

522 EZ t = EY t — EY t . In particular, the probability that an allele got lost from a population is not the 

523 same as it being absent from a sample. To model subsampling, we assume we start with fo copies of the 

524 derived allele and Aq copies of the ancestral allele, all evolving as a independent branching processes. The 

525 expected number of ancestral alleles will be EA t = Aq in all generations, whereas the expected number 

526 of the derived allele, conditioned on it not being lost, is EY t . We Hence, in generation i, the probability 

527 of drawing m copies of the derived allele out of n samples is approximately binomially distributed with 
52s parameters n and KY ^+A 0 ' mean of the expected allele frequency, conditional of sampling at least 

529 one derived allele is 

~ . n nEY t If EY t \ n nEY t (A 0 +EY t ) n - 1 , A . 

EY t — = — / —r- = 7 ■ A.26 

11 2N EY t + Ao/ \EY t +A 0 J Aft V ' 

530 with the ^ term normalizing Y t to allele counts. Setting Aq ps 2Af — EYt we obtain the series represen- 

531 tation 

EY t =EY t + ^(EY t ) 2 + o^y (A.27) 

532 Hence, 



EZ t = EY t - EYt = EYt - EY t + 2 



((EY t ) 2 -(EYt) 2 )+o(^. (A.28) 
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533 and we see that we have a bias term that increases with sample size. Hence, the easiest way to proceed is 

534 to downsample larger samples to a sample size of two, the case that is arguably most important in light 

535 of genomic data. 

536 To compare samples of size n\ and ri2, from a site frequency spectrum S = fy, 0 < i < n\, 0 < j < n-x 

537 we can calculate a reduced site frequency spectrum matrix S' from the full site frequency spectrum using 



S' = PiSP^, (A.29) 



53s where Pi and P2 are (2 + 1) x (711 + 1) and (2 + 1) x (712 + 1) matrices (with indeces starting at 0), 
539 respectively with entries 




(A.30) 



540 for 0 < i < n\ and 0 < j < 2 for Pi. Entries in P2, are similar, except m is replaced by n^. 

541 If we denote the entries of S' with , we can write EZ t as 

EZtV- = 512 ~ 521 . (A.31) 
S12 + s 2 i + sn 

542 This statistic is identical to the ip statistic defined in Peter and Slatkin (2013), where we did not give 

543 any theoretical justification. 



region 


longitude 


latitude 


q 


ri 


rw 


noo 


X e 


r 2 


P 


Scandinavia 


24.16 


60.32 


0.00065 


0.99869 


0.9871 


0.884 


7.75 


0.252 


4.2e-55 


USA 


-78.63 


44.22 


0.00118 


0.99763 


0.9768 


0.808 


4.26 


0.242 


6.8e-06 


Central Europe 


11.40 


46.84 


0.00035 


0.99928 


0.9928 


0.932 


14.05 


0.171 


8.3e-21 


Western Europe 


2.47 


43.33 


0.00013 


0.99973 


0.9973 


0.974 


38.60 


0.264 


4.7e-50 


Eastern Range 


48.29 


46.92 


0.00028 


0.99943 


0.9943 


0.946 


17.80 


0.115 


3.6e-22 



Table 1: Analysis of A. thaliana data. 

The table shows the inferred latitude and longitude of the origin, q: regression slope in km -1 , ri = k e /N e , 
for demes of size ikm. di, distance (in km) over which 1 — k e /N e = 1%. r 2 and p: adjusted coefficient of 
determination and Bonfcrroni-corrccted p-valuc. 
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Figure 1: Schematic of the expansion models studied. This figure shows the basic process we study, 
each square corresponds to a subpopulation, with grey borders indicating subpopulations not colonized 
at a time step. Each time step, a new deme is colonized (black, dashed arrows), and other demes undergo 
neutral genetic drift (grey arrows). We compare the allele frequencies {X t } at the expansion origin d\ 
(dashed bordercs) with the allele frequencies {X t } at the expansion front (dark backgrounds). 
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Figure 2: This figure shows the expected allele frequency difference between denies compared 
with simulations. Note: Axes labels aren't done yet. X-axis: Dcmc; Y-axis: Allele count difference 
to demc 1. Top row: k = 0.1N, middle row: k = 0.5N, bottom row: k = 0.9N. first column: 
f Q = 1,N = 1000, second column: f 0 = 10, N = 1000, third column: f 0 = 10, N = 10000, Fourth 
column: /o = 100, = 10000, red line: prediction using branching process model, black, blue and 
lightblue dots correspond to samples right after expansion reached deme 100, 100 generations later and 
500 generations later, respectively. Other parameters are t = 2, m = 0 and 10 6 alleles were generated 
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Figure 3: Comparison between WF-simulations and predictions from the branching process 
model under a logistic growth model. Growth rates were set to 1 (Panel a) and 0.5 (Panel b), 
respectively the lines correspond to 1-10 generations of logistic growth per expansion step (from bottom 
to top). Dots correspond to the simulated data, and the dashed lines are the analytical predictions using 
the harmonic mean of the population sizes. 
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Figure 4: Effect of migration rate and subsampling. Each set of points corresponds to ip estimated 
from simulations under a specific k e value, k e varies from 100 to 500 in increments of 100 (top to bottom/ 
blue to green). Grey dashed lines give the expectation from the branching process model. Top row: data 
sampled immediately after the expansion finished. Bottom row: data sampled a very long time (100 
coalescence units) after the expansion finished. Other parameters are: sample size n = 10, time between 
expansion events t e = 0.0001 (in coalescence units). 
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Figure 5: Effect of a 2D-geography. Each set of points corresponds to ip estimated from simulations 
under a specific k e value, k e varies from 100 to 500 in increments of 100 (top to bottom/ blue to green). 
Grey dashed lines give the expectation from the branching process model in one dimension. 
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Figure 6: Results for A. thaliana data set. Panel a: PCA analysis of the 121,412 SNP. Colors: 
Green: Scandinavia. Black: Americas. Blue: UK. Cyan: France. Light blue: Span & Portugal. Red: 
North- Western Europe. Orange: Switzerland & Italy. Pink: Central Europe. Brown: Russia, Lithuania 
and Western Asia. Panels b-f: Expansion for Scandinavia, USA, Central Europe, Western Europe and 
British Isles, respectively. Brighter regions indicate more likely origin of expansion. 
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